Altus Safety Automation · PoC architecture

How it's put together — and why

A walkthrough of the system that takes a signed-off site audit, has Claude cross-reference it against the job's retest sheet, drafts the failure report and audit summary, and lands the finished documents in SharePoint — with a human approval gate before anything leaves the building.

§ 01 The cloud — Azure where it all runs

Everything runs in one Azure resource group in the UK South region. The choices favour managed, serverless, pay-as-you-go services so the PoC carries no fixed monthly minimum — if no audits arrive in a day, compute cost is effectively zero.

Compute

Container Apps

Every container in the system runs here. Python services scale to zero when idle — n8n stays warm (one replica, always on) because it owns the cron triggers and webhook endpoints.

Why: right-sized for the workload — bursty Python services pay only when working, while the orchestrator is always listening.

Database

Postgres Flexible Server

Smallest tier (B1ms). Holds audit-trail rows, workflow state, replay cache. 7-day point-in-time-recovery.

Why: managed Postgres — no patching, no backups to script.

Image registry

Container Registry

One private registry holds every service image. Pulls authenticated by managed identity.

Why: integrated with Container Apps; no Docker Hub rate-limit pain.

Secrets

Key Vault

API keys (Anthropic, Microsoft Graph), database connection string, JWT signing keys. Never in source code or env files.

Why: rotation, audit, RBAC — standard for any production-bound posture.

Observability

Application Insights

Receives OpenTelemetry traces from every service. Backs the "Altus agent runs" Workbook and Teams alert rules.

Why: Microsoft-native, no separate vendor, queries via KQL.

Identity

Entra ID + UAMIs

Each service has a managed identity. Database, KV, and ACR access are token-based — no shared passwords.

Why: passwordless service-to-service auth; revoke an identity, you revoke its access.

Networking

Container Apps environment + NAT

All services share one VNet. Outbound traffic egresses through a NAT gateway with a known IP — useful for vendor IP allow-lists.

Why: single security perimeter; predictable egress.

Infrastructure-as-code

Bicep

Every Azure resource is declared in Bicep files in the repo. Provisioning is a single command. No click-ops.

Why: reproducible, reviewable, Microsoft-native (no Terraform vendor).

Storage & doc store

SharePoint Online

Not strictly Azure compute, but the same Microsoft tenant. Holds the client-folder structure and the coordinator approval list.

Why: coordinators already live in M365 — no new app to learn.

Likely running cost — for the PoC volume (tens of audits per week, not hundreds): ~£60–£100/month on Azure, plus Anthropic API spend governed by the £20/day cap (typically £50–£200/month in practice). Largest line items: Postgres (~£25/mo), n8n always-on container (~£15/mo), Python services on-demand compute (~£5–£15/mo combined), Application Insights ingestion (~£5–£15/mo). ACR, Key Vault, Storage round to a couple of pounds each. Azure costs are paid directly by Altus on their subscription — this engagement covers the build, not the running tab.

§ 02 At a glance today, in staging

The numbers behind the slide deck. This is one stack, three small services, one skill, one approval surface — deliberately small so a single technical person can build it, hand it over, and walk away from it.

AI runner
1
agent-svc-skilled — Anthropic Platform, hosted skill
Skills in library
1
Audit-review v1, hosted by Anthropic, source in git
Stage 3.5
Live
Cache + per-iteration observability merged, smoke green
Document templates
2
Audit summary & failure report, both editable Word files
Independent services
3
agent-svc-skilled, docgen-svc, fallback-svc (+ approval-app)
Workflow surface
n8n
Low-code, every box readable by a power user
Cost cap
£20/day
Circuit-breaks all paths if breached
Cloud surface
Azure
UK South, Container Apps, Postgres, SharePoint, App Insights

§ 03 The story why this exists

An Altus safety engineer signs off an audit on iAuditor. Today an offshore admin team copy-pastes that audit against the job's retest sheet, classifies each asset, writes the failure-report narrative for anything flagged, and produces two Word documents. It is slow, inconsistent, and the source of most quality issues that reach the client. This system removes the offshore step, keeps the human approval, and aims for clean documents in minutes.

The trigger

An audit is signed off

iAuditor (SafetyCulture) fires when a site visit is closed. The system polls every 15 minutes; a SharePoint drop-folder is the fallback if the API is unavailable.

The reasoning

Cross-reference & draft

Claude reads the audit and the retest summary sheet, matches assets, classifies pass / pass-with-recommendation / fail / not-tested, and drafts narrative for every flagged item — with verbatim citations back to the audit.

The output

Two Word docs, gated

Audit Summary + Failure Report land in the client's SharePoint folder. A coordinator opens them in Word, edits if needed, and approves in a SharePoint list. Nothing leaves Altus without a name on it.

§ 04 The shape of the system topology

Five small services, one orchestrator, one source of truth for skills, one approval surface. Connections are deliberately few. Every solid line is a place where a power user can see — or change — behaviour.

iAuditor / SafetyCulture External triggeraudit sign-offGitHub altus-safety-skills reposkill source of truthAnthropic Platform External SaaShosted skill + Claude APIn8n orchestrator Azure Container App · always-onworkflow engine · low-codeagent-svc-skilled Azure Container App · AI runnersecrets from Azure Key Vaultdocgen-svc Azure Container App · Word + SharePoint uploadsecrets from Azure Key Vaultfallback-svc Azure Container Appmanual ingestionapproval-app Azure Container Appcoordinator UIPostgres Flex Azure managed DBstate + audit trailApplication Insights Azure observabilityWorkbook 'Altus agent runs'SharePoint Online Microsoft 365 tenantdeliverables · drop folder · approval listTeams Microsoft 365 tenant#altus-ops-alerts👤 Coordinator human approval gate(Altus employee) drop folder CI sync on PR mergetraceshealth pings

Boxes inside grouped frames live inside that Azure / M365 surface. Thick arrows = main flow. Dotted arrows = secrets, traces, alerts, CI sync. Pinch / Ctrl+scroll to zoom.

n8n orchestrator Azure Container Apps Azure managed data (Postgres, KV) Azure observability (App Insights) Microsoft 365 tenant External SaaS (Anthropic) External trigger Human

§ 05 What lives where components, in plain terms

Each box on the diagram is a deliberate choice about where work happens and who can touch it. The principle: keep AI in the places that need reasoning, keep everything else mechanical and editable.

Orchestrator

n8n — the low-code conductor

Every workflow — "when an audit is signed off, do X then Y then Z" — lives here as a visual graph. The ops lead can open it in a browser, read the boxes, change the schedule, edit a query.

  • Polls iAuditor, fans out HTTP calls, branches on conditions
  • Owns the SharePoint approval poll
  • Hosts a heartbeat workflow that pings every service every 5 min

Why low-code? A handover-friendly tier: change behaviour without writing Python.

Reasoning

agent-svc-skilled — the AI runner

A small Python service. Takes the audit + retest sheet, asks Claude (using the hosted skill) to cross-reference and draft. Streams the structured JSON back to n8n.

  • Calls Anthropic /v1/messages with the hosted skill ID
  • Logs every call (cost, tokens, cache hits, iterations) to Postgres
  • Cost cap, circuit breaker, replay cache — all enforced server-side
  • Single, focused service — one job, done well
Doc generation

docgen-svc — Word + SharePoint

Takes the structured JSON from the agent and merges it into Word templates (audit_summary, failure_report). Uploads via Microsoft Graph to the right client folder.

  • No AI — pure template merge
  • Templates are editable Word documents in git
  • Same service will produce RAMS + reminder emails in Phase 2
Fallback

fallback-svc — the manual path

If iAuditor is down or a one-off PDF arrives, an admin drops the file into a SharePoint folder. This service watches that folder and feeds the same pipeline.

  • Hash-based de-duplication
  • Parses the retest summary sheet via openpyxl
  • Quarterly drill keeps it warm
Skill library

altus-safety-skills — the git repo

The source of truth for every skill. Plain markdown (SKILL.md) + JSON schema. A second repo — deliberately separate from the platform code so a power user can edit a skill without touching infrastructure.

  • CI auto-uploads to Anthropic on every merge
  • Every PR runs the 6-fixture eval against the real Anthropic API
  • Power users edit markdown, not Python
Approval gate

approval-app — coordinator view

A small web app (and a SharePoint list, today) that surfaces every AI draft to a coordinator. They open the Word doc, edit if needed, tick approved.

  • Read-only audit-trail page at /obs/reasoning-calls
  • The system never sends a client document unapproved
  • Coordinators stay in M365 where they already live
State

Postgres — audit trail & cache

One small database, three load-bearing tables: workflow_runs, audits, reasoning_calls. Every Claude call is recorded, in full, for the safety regulator angle.

  • Replay cache — same input ⇒ same output, no second API call
  • Cost ledger by day, per path
  • Iteration-level traces from Stage 3.5

§ 06 Why one focused runner deliberate simplicity

A single AI runner — agent-svc-skilled — is the only path real audit traffic crosses. One service to operate, one to deploy, one to understand. The Anthropic Platform's Skills API gives us hosted prompts that scale without per-call bloat, and the skill itself lives in a separate git repo so a power user can edit it without touching the runner.

What this gives usHow
One thing to operate
no path-selection logic, no fallback wiring
n8n calls one URL. One Container App. One image to rebuild. The audit trail in reasoning_calls has one shape.
Skills scale cheaply
progressive disclosure on the Platform
As the skill library grows (RAMS, sanity, etc.), only the metadata is loaded per call. Token cost stays flat.
Power-user edits, safely
skill source in a separate git repo
Editing the prompt is a PR against altus-safety-skills. CI runs the 6-fixture eval against real Anthropic before merge.
Provider exit is not free, but possible
if Anthropic relationship ever sours
The skill body and schema are in git, not in Anthropic. A future swap to a Messages-API-only path (or another provider) is a multi-day rebuild — not a multi-week one.

§ 07 Three design pillars non-negotiables

Every decision in this architecture is in service of three commitments. If a change would weaken one, it's the wrong change.

i.

Low-code visibility

An operations lead with no Python should be able to read every workflow in n8n, every skill in the git repo, every Word template — and edit them with help from any competent low-code developer. Hand-over readiness is a first-class requirement, not an afterthought.

ii.

AI in the right places only

The skills system is reserved for tasks that need reasoning: cross-referencing, classification, narrative drafting. RAMS production, reminder emails, deal-to-job sync — all mechanical, no AI. We are not paying inference cost where a template merge would do.

iii.

Safe by design

Every AI output is captured in an audit trail. Every client document passes a coordinator before it's sent. Cost is capped daily, by code, with a circuit breaker. The fallback path means a vendor outage degrades to slower service, not silently to bad output.

§ 08 Entry & exit points where work begins & ends

There are very few. That's a feature — it means the perimeter is small, security has fewer surfaces to defend, and a stakeholder asking "where does an audit go in?" has a one-sentence answer.

In

Entry · 1
iAuditor poll
Primary trigger. n8n polls the SafetyCulture API every 15 min for audits in signed off state.
Entry · 2
SharePoint drop-folder
Fallback. Admin manually exports the audit PDF; fallback-svc watches the folder.
Entry · 3 (future)
Webhook or approval-app upload
Open question — choice depends on where audits live operationally. 30-min scoping conversation, not implementation.

Out

Exit · 1
DOCX → SharePoint
Two Word documents per audit, dropped into /Clients/<Client>/<Year>/<Job>/.
Exit · 2
Approval list row
A coordinator sees a new row, opens the docs, edits, ticks approved.
Exit · 3
Coordinator sends to client
Manual step today. The coordinator is the final filter. Nothing leaves Altus without a name on it.

§ 09 Total oversight where to look

If a stakeholder asks "is the system working?" or "what did Claude actually do for that audit?" — here are the seven places to look. Each tells you something different; together they cover everything.

SurfaceWhat it tells youAudience
n8n console
low-code workflow UI
Every workflow execution, every node's inputs & outputs, success/failure per step. The first place to look on a "what just happened?" question. Ops lead, dev partner
Azure Workbook
"Altus agent runs (stg)"
Cache hit rate, iteration distribution, runaway-loop detection, cost per day per path. Stage 3.5 deliverable, deployed via Bicep. Dev partner, finance
SharePoint approval list
"Altus Audit Approvals"
Every audit waiting for a coordinator. Confidence score, document links, approver, approval date. Coordinators, ops lead
approval-app / obs route
read-only API view
Every Claude call — tokens, cost, skill version, iteration count, container, stop reason. Audit trail for the safety regulator angle. Dev partner, compliance
Teams alerts channel
#altus-ops-alerts
Heartbeat-driven alerts on iAuditor, Anthropic, Graph availability. Three-tier severity (info, warning, action). Everyone
Anthropic console
platform.claude.com
Workspace billing dashboard, Skills versioning UI, rate-limit status. Dev partner
GitHub repos
altus + altus-safety-skills
Every code & skill change as a reviewable PR. CI runs the eval against real Anthropic on every skill PR. Dev partner, ops lead

§ 10 Power-user handover what an ops lead inherits

The system is built so that Altus's operations lead — not a software engineer — can keep it running, evolve the rules, and bring in a generic low-code developer if a bigger change is needed.

Day-1 — they can

Operate without dev help

  • Open n8n, watch a workflow execute, re-run a failed step
  • Tick approval on a SharePoint list row, send DOCX to a client
  • Read the runbook + watch the Loom walkthrough
  • Triage a Teams alert (which integration is down? when was the last green heartbeat?)
Day-30 — they can

Evolve the rules

  • Edit a Word template — new column, different heading — via git PR
  • Tweak a skill prompt (open SKILL.md in any editor; CI guards it)
  • Change a workflow's schedule, add a Slack notification, add a filter
  • Read the audit trail to explain a model decision
When bigger change needed

Bring in any low-code dev

The architecture is small enough that any competent low-code developer can pick it up. Skill markdown, n8n workflows, Word templates — all standard formats. Python services are 800-1500 LOC each.

The 30-day snagging window covers anything that breaks before the ops lead is comfortable.

§ 11 Extensibility what changes cost what

The cheap changes are deliberate. Anything Altus is likely to want next quarter is a markdown edit, a Word template change, or an n8n workflow add. The expensive changes are the rare ones.

ChangeWhere it happensEffort
Tweak the AI prompt
e.g. soften narrative tone
Edit SKILL.md in git, open PR, CI runs 6-fixture eval. Auto-syncs to Anthropic Platform on merge. 15 min
Change a Word template
e.g. new client logo, new column
Edit .docx in git, sample-render attached to the PR comment for review. 30 min
Add a workflow
e.g. weekly summary email
Build in n8n console — visual drag & drop, no code. 1 hr
Add a new skill
e.g. RAMS-tone-checker
Create new directory in skills repo with SKILL.md + 6 eval fixtures. CI uploads, runners discover it via skill ID. 1 day
Swap inference provider
e.g. to Bedrock / OpenAI
Fork agent-svc-skilled, replace the gateway, inline the skill body from git. Same /run contract for n8n. 3 days
Change the cloud
e.g. Azure → AWS
All Bicep would become Terraform/CDK; Container Apps become ECS/Fargate. Code unchanged. Architecture mirrors what Softkrtl already runs on AWS. ~2 wks

§ 12 Risks eyes open

A short, honest list. Most are time-bounded or covered by the backup path. The two highlighted ones are the genuine fronts to watch.

RiskWhat it meansSeverityMitigation
Beta-API drift
Anthropic Skills + Sessions still beta
The hosted-skills feature is in beta. Anthropic could change request shape on short notice. Watch Pinned SDK versions, release-note subscription. Skill body in git, swappable to inlined-Messages path in a few days if outage.
Single environment
staging IS production for PoC
No separate altus-prod-rg. A breaking deploy in staging affects real audits once onboarded. Known Deploy script digest-pins every image; manual gate; rollback is a previous-revision activate.
Vendor coupling
Anthropic Platform for skills hosting
Skills hosting only available from Anthropic. Provider change would require swapping. Hedged Skill source-of-truth is git, not Anthropic. Rebuild as Messages-API-only runner takes days, not weeks.
iAuditor API access
Premium tier may be required
If API access proves harder than expected, real audits cannot trigger automatically. Open SharePoint drop-folder fallback already wired; manual entry path runs end-to-end identical.
Trial-clock expiries
SafetyCulture 2026-05-31, M365 2026-06-16
Vendor trials end soon; renewal vs cancel decisions required. Calendar Already flagged on runbook; commercial conversation before each date.
Coordinator-edit rate
how often a human rewrites the AI
If the rate stays above 20% after shadow period, model output isn't good enough — trust erodes. Watch Captured in reasoning_calls; eval corpus grows from every disagreement; human-in-loop always present.
ZDR not available
on Anthropic Skills
Hosted skills are not Zero Data Retention eligible. Standard retention applies. Accepted Explicit user decision; revisit only if compliance posture changes.